-
Notifications
You must be signed in to change notification settings - Fork 3k
Flink: Project the RowData to remove meta-columns #3240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# Conflicts: # flink/src/main/java/org/apache/iceberg/flink/source/RowDataFileScanTaskReader.java # flink/src/test/java/org/apache/iceberg/flink/TestHelpers.java
flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java
Outdated
Show resolved
Hide resolved
|
@rdblue @kbendick @openinx @stevenzwu Could you take a look of this again? 😄 |
flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java
Show resolved
Hide resolved
| ) | ||
| )) | ||
| ); | ||
| AssertHelpers.assertThrows("Should be error because cannot project a partial nested list element.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: This is a little confusing for me at first.
Can we possibly rephrase this as Should not all users to project onto a subset of fields of a struct used in a list type? That would make what's being tested a bit more clear (at least for me) from the get go.
| boolean elementProjectable = !projectedList.elementType().isNestedType() || | ||
| projectedList.elementType().equals(originalList.elementType()); | ||
| Preconditions.checkArgument(elementProjectable, | ||
| "Cannot project a partial list element RowData. Trying to project %s out of %s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See note below about this exception message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I trying to keep this message same as StructLikeProjection. I feel this msg is ok, What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java
Show resolved
Hide resolved
flink/src/test/java/org/apache/iceberg/flink/TestChangeLogTable.java
Outdated
Show resolved
Hide resolved
flink/src/test/java/org/apache/iceberg/flink/TestChangeLogTable.java
Outdated
Show resolved
Hide resolved
flink/src/test/java/org/apache/iceberg/flink/data/TestRowDataProjection.java
Show resolved
Hide resolved
| private static RowData.FieldGetter createFieldGetter(RowType rowType, | ||
| Types.StructType rowStruct, | ||
| Types.NestedField projectField) { | ||
| for (int i = 0; i < rowStruct.fields().size(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this loop find essentially results in n^2 complexity. We can use this API from StructType.
public NestedField field(int id)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we not only need to found the row field which field id equal to project field id, but also need to know the position of the match field. Even if we can get the match row field by StructType.field(int id), we also need to traverse the rowStruct to found out the field position again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Can we iterate through the schema once and set up the mapping btw field id and position id? I have a little performance concern of n^2 complexity for table with a lot of columns (like thousands or more).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. There are tables with very high cardinality where this will potentially have a real performance impact. This tends to be especially true for base tables (raw ingested data events from clients etc), which often have very wide schemas and is also an area where Flink is pretty commonly used.
Anything that can be done to reduce this overhead would be great.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is a great idea, let's do this~
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a little performance concern of n^2 complexity for table with a lot of columns (like thousands or more)
I'm fine with either. Because the complexity is actually n*m, let's say the n is the table's field number and m is the projection fields number. If both @stevenzwu and @kbendick think it's necessary to do, I'm okay with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a little performance concern of n^2 complexity for table with a lot of columns (like thousands or more)
I'm fine with either. Because the complexity is actually n*m, let's say the n is the table's field number and m is the projection fields number. If both @stevenzwu and @kbendick think it's necessary to do, I'm okay with it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I construct a fieldIdToPosition map and use StructType.field(int id) to find the row field. Now the complexity reduce to n, I think the performance will not be a problem.
|
@Reo-LEI Could you take a look for the checkstyle issue ? |
|
I'd like to take a look for this PR today, I think it's critical important feature for our flink users to read the v2 table. Thanks @Reo-LEI for picking up this PR ! |
Sure, I will fix this latter. |
| RowType nestedRowType = (RowType) rowType.getTypeAt(i); | ||
| int rowPos = i; | ||
| return row -> { | ||
| RowData nestedRow = row.isNullAt(rowPos) ? null : row.getRow(rowPos, nestedRowType.getFieldCount()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: If the nestedRow is null, do we still need to traverse the nested fields by using the RowDataProjection#project ? I think we can just return the null for the projection value ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a small patch for this:
diff --git a/flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java b/flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java
index 9d1e8ea67..25a5b3ab3 100644
--- a/flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java
+++ b/flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java
@@ -45,7 +45,11 @@ public class RowDataProjection implements RowData {
* @return a wrapper to project rows
*/
public static RowDataProjection create(Schema schema, Schema projectedSchema) {
- return new RowDataProjection(FlinkSchemaUtil.convert(schema), schema.asStruct(), projectedSchema.asStruct());
+ return RowDataProjection.create(FlinkSchemaUtil.convert(schema), schema.asStruct(), projectedSchema.asStruct());
+ }
+
+ public static RowDataProjection create(RowType rowType, Types.StructType schema, Types.StructType projectedSchema) {
+ return new RowDataProjection(rowType, schema, projectedSchema);
}
private final RowData.FieldGetter[] getters;
@@ -73,9 +77,14 @@ public class RowDataProjection implements RowData {
RowType nestedRowType = (RowType) rowType.getTypeAt(i);
int rowPos = i;
return row -> {
- RowData nestedRow = row.isNullAt(rowPos) ? null : row.getRow(rowPos, nestedRowType.getFieldCount());
- return new RowDataProjection(nestedRowType, rowField.type().asStructType(),
- projectField.type().asStructType()).wrap(nestedRow);
+ if (row.isNullAt(rowPos)) {
+ return null;
+ } else {
+ RowData nestedRow = row.getRow(rowPos, nestedRowType.getFieldCount());
+ return RowDataProjection
+ .create(nestedRowType, rowField.type().asStructType(), projectField.type().asStructType())
+ .wrap(nestedRow);
+ }
};
case MAP:There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could not return null when the nestedRow is null. Because StructProjection will still project the nested struct even if the nested struct is null. If we return null here, the unittest will fail, because the expected record is not null but actual row data is null.
| Assert.assertTrue("expected Record and actual RowData should be both null or not null", |
|
|
||
| // Project the RowData to remove the extra meta columns. | ||
| if (!projectedSchema.sameSchema(deletes.requiredSchema())) { | ||
| RowDataProjection rowDataProjection = RowDataProjection.create(deletes.requiredSchema(), projectedSchema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the RowDataProjection#create does a FlinkSchemaUtil.convert(schema) for the required schema to project, and I believe the FlinkDeleteFilter also did the same thing inside. I think we can reuse the converted flink row type between them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Now I get the row type from FlinkDeleteFilter and pass it to RowDataProjection.
flink/src/main/java/org/apache/iceberg/flink/source/RowDataFileScanTaskReader.java
Show resolved
Hide resolved
| return this; | ||
| } | ||
|
|
||
| public Object getValue(int pos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this can be a private method, right ?
| } | ||
| } | ||
| } | ||
| throw new IllegalArgumentException(String.format("Cannot find field %s in %s", projectField, rowStruct)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think we need a more clear message for this exception: Cannot locate the project field <%s> in the iceberg struct <%s>
| for (int i = 0; i < rowStruct.fields().size(); i++) { | ||
| Types.NestedField rowField = rowStruct.fields().get(i); | ||
| if (rowField.fieldId() == projectField.fieldId()) { | ||
| Preconditions.checkArgument(rowField.type().typeId() == projectField.type().typeId(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this can be simplified by the following lines as the Preconditions.checkArgument can format the error message directly.
Preconditions.checkArgument(rowField.type().typeId() == projectField.type().typeId(),
"Different iceberg type between row field <%s> and project field <%s>",
rowField, projectField);| RowType nestedRowType = (RowType) rowType.getTypeAt(i); | ||
| int rowPos = i; | ||
| return row -> { | ||
| RowData nestedRow = row.isNullAt(rowPos) ? null : row.getRow(rowPos, nestedRowType.getFieldCount()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a small patch for this:
diff --git a/flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java b/flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java
index 9d1e8ea67..25a5b3ab3 100644
--- a/flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java
+++ b/flink/src/main/java/org/apache/iceberg/flink/data/RowDataProjection.java
@@ -45,7 +45,11 @@ public class RowDataProjection implements RowData {
* @return a wrapper to project rows
*/
public static RowDataProjection create(Schema schema, Schema projectedSchema) {
- return new RowDataProjection(FlinkSchemaUtil.convert(schema), schema.asStruct(), projectedSchema.asStruct());
+ return RowDataProjection.create(FlinkSchemaUtil.convert(schema), schema.asStruct(), projectedSchema.asStruct());
+ }
+
+ public static RowDataProjection create(RowType rowType, Types.StructType schema, Types.StructType projectedSchema) {
+ return new RowDataProjection(rowType, schema, projectedSchema);
}
private final RowData.FieldGetter[] getters;
@@ -73,9 +77,14 @@ public class RowDataProjection implements RowData {
RowType nestedRowType = (RowType) rowType.getTypeAt(i);
int rowPos = i;
return row -> {
- RowData nestedRow = row.isNullAt(rowPos) ? null : row.getRow(rowPos, nestedRowType.getFieldCount());
- return new RowDataProjection(nestedRowType, rowField.type().asStructType(),
- projectField.type().asStructType()).wrap(nestedRow);
+ if (row.isNullAt(rowPos)) {
+ return null;
+ } else {
+ RowData nestedRow = row.getRow(rowPos, nestedRowType.getFieldCount());
+ return RowDataProjection
+ .create(nestedRowType, rowField.type().asStructType(), projectField.type().asStructType())
+ .wrap(nestedRow);
+ }
};
case MAP:| private static RowData.FieldGetter createFieldGetter(RowType rowType, | ||
| Types.StructType rowStruct, | ||
| Types.NestedField projectField) { | ||
| for (int i = 0; i < rowStruct.fields().size(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a little performance concern of n^2 complexity for table with a lot of columns (like thousands or more)
I'm fine with either. Because the complexity is actually n*m, let's say the n is the table's field number and m is the projection fields number. If both @stevenzwu and @kbendick think it's necessary to do, I'm okay with it.
| private static RowData.FieldGetter createFieldGetter(RowType rowType, | ||
| Types.StructType rowStruct, | ||
| Types.NestedField projectField) { | ||
| for (int i = 0; i < rowStruct.fields().size(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a little performance concern of n^2 complexity for table with a lot of columns (like thousands or more)
I'm fine with either. Because the complexity is actually n*m, let's say the n is the table's field number and m is the projection fields number. If both @stevenzwu and @kbendick think it's necessary to do, I'm okay with it.
| boolean valueProjectable = !projectedMap.valueType().isNestedType() || | ||
| projectedMap.valueType().equals(originalMap.valueType()); | ||
| Preconditions.checkArgument(keyProjectable && valueProjectable, | ||
| "Cannot project a partial map key or value RowData. Trying to project %s out of %s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should say Cannot project a partial map key or value with non-primitive type, Trying .., the assert failure does not mean it's necessary to be a RowData, it can be other data types such as list or map etc.
|
I adressed some comment and leave some comment to discuss. I think you can take another looks of this PR. @rdblue @openinx @stevenzwu @kbendick 😄 |
|
Let me take another look today ! Thanks @Reo-LEI for the updating. |
| /** | ||
| * Creates a projecting wrapper for {@link RowData} rows. | ||
| * <p> | ||
| * This projection does not work with repeated types like lists and maps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This projection does not work with repeated types like lists and maps with nested children types ? I think it works fine for lists/maps with primitive children types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to say: Projecting a partial map key or value with non-primitive type does not work in this projection wrapper
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meaning of this comment is exactly what you said that the projection will not project the nested children types of repeated types. I will rephrase it.
| /** | ||
| * Creates a projecting wrapper for {@link RowData} rows. | ||
| * <p> | ||
| * This projection does not work with repeated types like lists and maps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
| if (rowField == null) { | ||
| throw new IllegalArgumentException(String.format( | ||
| "Cannot locate the project field <%s> in the iceberg struct <%s>", projectField, rowStruct)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
Preconditions.checkNotNull(rowField,
"Cannot locate the project field <%s> in the iceberg struct <%s>", projectField, rowStruct);
openinx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me overall, thanks @Reo-LEI for the great contribution and thanks @stevenzwu for the double check. I left several minor comments.
|
I addressed the rest of comments just now, and you can check this again @openinx. And thanks @openinx @stevenzwu @kbendick @rdblue for review, |
|
Sorry for missing some pings. Was out of office for a few weeks a while back and have still been playing a bit of catch up. Please feel free to message me on slack if it's urgent btw. But retroactive +1. 🙂 |
This PR is completed on the basis of #2731 and trying to fixes #2730. Thanks for the contribution of @openinx.
In this PR, I make
RowDataProjectionas row data wrapper as this comment #2731 (comment) mentioned and supprot theMapandListtype projection.